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Abstract — Signal processing tasks as fundamental as sampling, 
reconstruction, minimum mean-square error interpolation and 
prediction can be viewed under the prism of reproducing kernel 
Hilbert spaces. Endowing this vantage point with contemporary 
advances in sparsity-aware modeling and processing, promotes 
the nonparametric basis pursuit advocated in this paper as 
the overarching framework for the confluence of kernel-based 
learning (KBL) approaches leveraging sparse linear regression, 
nuclear-norm regularization, and dictionary learning. The novel 
sparse KBL toolbox goes beyond translating sparse parametric 
approaches to their nonparametric counterparts, to incorporate 
new possibilities such as multi-kernel selection and matrix 
smoothing. The impact of sparse KBL to signal processing 
applications is illustrated through test cases from cognitive 
radio sensing, microarray data imputation, and network traffic 
prediction. 



L Introduction 

Reproducing kernel Hilbert spaces (RKHSs) provide an or- 
derly analytical framework for nonparametric regression, with 
the optimal kernel-based function estimate emerging as the 
solution of a regularized variational problem [|33l . The pivotal 
role of RKHS is further appreciated through its connections 
to "workhorse" signal processing tasks, such as the Nyquist- 
Shannon sampling and reconstruction result that involves sine 
kernels ll24l . Alternatively, spline kernels replace sine kernels, 
when smoothness rather than bandlimitedness is to be present 
in the underlying function space 131]. 

Kernel-based function estimation can be also seen from 
a Bayesian viewpoint. RKHS and linear minimum mean- 
square error (LMMSE) function estimators coincide when the 
pertinent co variance matrix equals the kernel Gram matrix. 
This equivalence has been leveraged in the context of field 
estimation, where spatial LMMSE estimation referred to as 
Kriging, is tantamount to two-dimensional RKHS interpolation 
ifTOll . Finally, RKHS based function estimators can linked 
with Gaussian processes (GPs) obtained upon defining their 
covariances via kernels [25 1. 

Yet another seemingly unrelated, but increasingly popular 
theme in contemporary statistical learning and signal process- 
ing, is that of matrix completion |[T2l . where data organized 
in a matrix can have missing entries due to e.g., limitations 
in the acquisition process. This article builds on the assertion 
that imputing missing entries amounts to interpolation, as in 
classical sampling theory, but with the low-rank constraint 
replacing that of bandlimitedness. From this point of view, 
RKHS interpolation emerges as the prudent framework for 



matrix completion that allows effective incorporation of a 
priori information via kernels p|, including sparsity attributes. 

Recent advances in sparse signal recovery and regression 
motivate a sparse kernel-based learning (KBL) redux, which 
is the purpose and core of the present paper. Building blocks 
of sparse signal processing include the (group) least- absolute 
shrinkage and selection operator (Lasso) and its weighted 
versions |[T6l , compressive sampling f^, and nuclear norm 
regularization [[T2 ll. The common denominator behind these 
operators is the sparsity on a signal's support that the £i- 
norm regularizer induces. Exploiting sparsity for KBL leads to 
several innovations regarding the selection of multiple kernels 
E3, O, additive modeling EH, lEll, collaborative filtering 
131 . matrix and tensor completion via dictionary learning JTl, 
as well as nonparametric basis selection [6|. In this context, 
the main contribution of this paper is a nonparametric basis 
pursuit (NBP) tool, unifying and advancing a number of sparse 
KBL approaches. 

Constrained by space limitations, a sample of applications 
stemming from such an encompassing analytical tool will be 
also delineated. Sparse KBL and its various forms contribute 
to computer vision ll28ll . ll32l . cognitive radio sensing ||6l, 
management of user preferences bioinformatics [|29l , 
econometrics ll2T1l . ll26l . and forecasting of electric prices, 
load, and renewables (e.g., wind speed) [18|, to name a few. 

The remainder of the paper is organized as follows. Section 
II reviews the theory of RKHS in connection with GPs, 
describing the Representer Theorem and the kernel trick, 
and presenting the Nyquist- Shannon Theorem (NST) as an 
example of KBL. Section III deals with sparse KBL including 
sparse additive models (SpAMs) and multiple kernel learning 
(MKL) as examples of additive nonparametric models. NBP 
is introduced in Section IV, with a basis expansion model cap- 
turing the general framework for sparse KBL. Blind versions 
of NBP for matrix completion and dictionary learning are 
developed in Sections V and VI. Finally, Section VII presents 
numerical tests using real and simulated data, including RE 
spectrum measurements, expression levels in yeast, and net- 
work traffic loads. Conclusions are drawn in Section VIII, 
while most technical details are deferred to the Appendix. 

II. KBL Preliminaries 

In this section, basic tools and approaches are reviewed to 
place known schemes for nonparametric (function) estimation 
under a common denominator. 
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A. RKHS and the Representer Theorem 

In the context of reproducing kernel Hilbert spaces (RKHS) 
1331 , nonparametric estimation of a function / : A' ^ R de- 
fined over a measurable space X is performed via interpolation 
of training points {(xi, ),..., (xat, zn)}^ where Xn ^ A', 
and Zn = f{xn) + en G M. For this purpose, a kernel function 
k \ XxX selected to be symmetric and positive definite, 
specifies a linear space of interpolating functions f{x) given 
by 

Hx '= |/(^) =^^^ank{xn,x) : an eR.Xn eX,n eN^ . 

For many choices of •), is exhaustive with respect to 
(w.r.t) families of functions obeying certain regularity condi- 
tions. The spline kernel for example, generates the Sobolev 
space of all low-curvature functions [11]. Likewise, the sine 
kernel gives rise to the space of bandlimited functions. Space 
Hx becomes a Hilbert space when equipped with the inner 
product < f,f >n;^'= E^n'=r^n<^n, </), and the 
associated norm is ||/||^;^ := \/< f^f >i-ix- A key result in 
this context is the so-termed Representer Theorem [33], which 
asserts that based on {(xn, ^n)}^=i, the optimal interpolator 
in Hx, in the sense of 

N 

f = arg min ^{z^ - f{xn)f + Mn. (D 
admits the finite-dimensional representation 

N 

f{x) = ^ank{xn,x). (2) 

n=l 

This result is nice in its simplicity, since functions in space 
H X are compound by a numerable but arbitrarily large number 
of kernels, while / is a combination of just a finite number of 
kernels around the training points. In addition, the regularizing 
term controls smoothness, and thus reduces overfit- 

ting. After substituting Q into ([B, the coefficients := 
[ai, . . . ,aAr] minimizing the regularized least-squares (LS) 
cost in ([T]) are given by a = (K + /il)~^z, upon recognizing 
that ll/lll^^ := a^Ka, and defining := [zi^ . . . ^ zn] as 
well as the kernel dependent Gram matrix K G R^^^ with 
entries Kn,n' •= k{xn,Xn') (-^ stands for transposition). 
Remark 1. The finite-dimensional expansion (O solves ([B 
for more general fitting costs and regularizing terms. In its 
general form, the Representer Theorem asserts that (0 is the 
solution 

N 

/ = arg min ^ £(^„, /(x„)) + M^^dl/lk;,) (3) 

where the loss function i{zm f{xn)) replacing the LS cost 
in ([T]) can be selected to serve either robustness (e.g., using 
the absolute- value instead of the square error); or, application 
dependent objectives (e.g., the Hinge loss to serve classifica- 
tion applications); or, for accommodating non-Gaussian noise 
models when viewing ^ from a Bayesian angle. On the other 
hand, the regularization term can be chosen as any increasing 
function ft of the norm ||/||^^, which will turn out to be 
crucial for introducing the notion of sparsity, as described in 
the ensuing sections. 



B. LMMSE, Kriging, and GPs 

Instead of the deterministic treatment of the previous sub- 
section, the unknown f{x) can be considered as a random 
process. The KBL estimate ^ offered by the Representer 
Theorem has been linked with the LMMSE-based estimator 
of random fields f{x), under the term Kriging ifTOl . To predict 
the value C = f{^) at an exploration point x via Kriging, the 
predictor f{x) is modeled as a linear combination of noisy 
samples Zn := f{xn)^r]{xn) at measurement points {xn}n^i\ 
that is, 

N 

f{x) = Y,^nZn=Z^0 (4) 

n=l 

where $^ := . . . , $n] are the expansion coefficients, and 
:= [zi^ . . . ^ Zn] collects the data. The MSE criterion is 
adopted to find the optimal $ := argmin/3 £^[/(x) — z^/3]^. 
Solving the latter yields $ = R"^ Tz^, where Rzz E[zz^] 
and Tz(^ := E[zf{x)]. If r]{x) is zero-mean white noise with 
power cr^, then Rzz and can be expressed in terms of the 
unobserved := [/(^i), • • • , /{xn)] as Rzz = Rcc + ^^I' 
where R(^<^ := E[CC^], and Tz^ = r<^^, with r^^^ := E[Cf{x)]. 
Hence, the LMMSE estimate in ^ takes the form 

N 

f{x) = z^(R<^<^ + a^I)"^r<^^ = ^ anr{x, Xn) (5) 

n=l 

where := z^(R(^<^ + cr^I)~^, and the n-th entry of r^^^, 
denoted by r{xn^x) := E[f{x)f{xn)], is indeed a function of 
the exploration point x, and the measurement point Xn- 

With the Kriging estimate given by ([5]), the RKHS and 
LMMSE estimates coincide when the kernel in Q is chosen 
equal to the covariance function r{x^x') in (O. 

The linearity assumption in is unnecessary when f{x) 
and e{x) are modeled as zero-mean GPs [|25l . GPs are those 
in which instances of the field at arbitrary points are jointly 
Gaussian. Zero-mean GPs are specified by co\{x^x') := 
E[f{x)f{x')], which determines the covariance matrix of 
any vector comprising instances of the field, and thus its 
specific zero-mean Gaussian distribution. In particular, the 
vector := [/(x), / (xi ),..., /(xat)] collecting the field at 
the exploration and measurement points is Gaussian, and so is 
the vector z^ := [/(x), /(xi) + 7^(xi), . . . , /(xat) +77(xiv)] = 
[C, z^]. Hence, the MMSE estimator, given by the expectation 
of f{x) conditioned on z, reduces to [il7i 

AT 

f{x) = E{f{x)\z) = z^R-^rl^ = ^anCOv(xn,x). (6) 

By comparing (|6]) with (|5]), one deduces that the MMSE 
estimator of a GP coincides with the LMMSE estimator, hence 
with the RKHS estimator, when cov(x,x') = k{x^x'). 

C. The kernel trick 

Analogous to the spectral decomposition of matrices. Mer- 
cer's Theorem establishes that if the symmetric positive def- 
inite kernel is square-integrable, it admits a possibly infinite 
eigenfunction decomposition k{x^x') = Yll!Li^i^i{^)^i{^') 
ll33l , with < ei{x)^ei'{x) >^^= 5i-i' where 5i stands 
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for Kronecker's delta. Using the weighted eigenf unctions 
(l)i{x) := \^Xiei{x)^ z G N, a point x G A' can be mapped to a 
vector (sequence) G such that = (j)i{x)^ i e N. This 
mapping interprets a kernel as an inner product in R^, since 
for two points x, G A', k{x^x') = Yll^i^'i{^)^i{^') '-— 
(f)^ {x)4>{x'). Such an inner product interpretation forms the 
basis for the "kernel trick." 

The kernel trick allows for approaches that depend on inner 
products of functions (given by infinite kernel expansions) 
to be recast and implemented using finite dimensional co- 
variance (kernel) matrices. A simple demonstration of this 
valuable property can be provided through kernel-based ridge 
regression. Starting from the standard ridge estimator ^ := 

argmin^gRZ,E!Li(^n " ^If^? + for e K^, 

and $ := [01, . . . , (/)Ar], it is possible to rewrite and solve 
/3 = argmin^^M- ||z-*^/3|p+^||/3f = ($$^+/iI)-i$z. 
After /3 is obtained in the training phase, it can be used for pre- 
diction of an ensuing zn+i = ^at+i/^ given 4>n+i- By using 
the matrix inversion lemma, ^at+i can be written as ^at+i = 
(1/m)0?;+i*z - + *^*)-i*^*z. 

Now, if (j)n = <P{xn) with I) = oo is constructed from 
G A' using eigenfunctions {(j)i{xn)}iZi^ then 0^^^^ = 
k^(^Ar+i) := [/c(xAr+i,Xi),...,/c(xAr+i,XAr)], and = 
K, which yields 

i^+i = (l//i)k^(x^+i)[I - (/i/ + K)-^K]z 

= k^(x,v+i)(/iI + K)-iz (7) 

coinciding with ©, (O, and with the solution of ([T]). 

Expressing a linear predictor in terms of inner products only 
is instrumental for mapping it into its kernel-based version. 
Although the mapping entails the eigenfunctions {(j)i{x)}, 
these are not explicitly present in (|7]), which is given solely 
in terms of k{x^x'). This is crucial since can be infinite 
dimensional which would render the method computationally 
intractable, and more importantly the explicit form of (j)i{x) 
may not be available. Use of kernel trick was demonstrated 
in the context of ridge regression. However, the trick can be 
used in any vectorial regression or classification method whose 
result can be expressed in terms of inner products only. One 
such example is offered by support vector machines, which 
find a kernel-based version of the optimal linear classifier 
in the sense of minimizing Vapnik's e-insensitive Hinge loss 
function, and can be shown equivalent to the Lasso |[T4t . 

In a nutshell, the kernel trick provides a means of designing 
KBL algorithms, both for nonparametric function estimation 
[cf. ([B], as well as for classification. 

D. KBL vis a vis Nyquist-Shannon Theorem 

Kernels can be clearly viewed as interpolating bases [cf. 
0] . This viewpoint can be further appreciated if one considers 
the family of bandlimited functions := {/ G C'^{X) : 
/ /(x)e-^^^dx = 0, V|cj| > tt}, where denotes the 
class of square-integrable functions defined over A' = R 
(e.g., continuous-time, finite-power signals). The family 
constitutes a linear space. Moreover, any / G can be 
generated as the linear combination (span) of sine functions; 
that is, f{x) = '^Zne'L /(^)sinc(x — n). This is the cornerstone 



of signal processing, namely the NST for sampling and 
reconstruction, but can be viewed also under the lens of RKHS 
with k{x^x') = sinc(x — x') as a reproducing kernel [l24l . 
The following properties (which are proved in the Appendix) 
elaborate further on this connection. 

PI. The sinc-kernel Gram matrix K G R^^^ satisfies K ^ 0. 
P2. The sine kernel decomposes over orthonormal eigenfunc- 
tions {0n(^) = sinc(x — n), n G Z}. 
P3. The RKHS norm is ||/|||^^ = / f{x)dx. 

PI states that sinc(x — x') qualifies as a kernel, while P2 
characterizes the eigenfunctions used in the kernel trick, and 
P3 shows that the RKHS norm is the restriction of the £^ 
norm to B^r- 

P1-P3 establish that the space of bandlimited functions B^^ 
is indeed an RKHS. Any f e B^^ can thus be decomposed 
as a numerable combination of eigenfunctions, where the 
coefficients and eigenfunctions obey the NST. Consequently, 
existence of eigenfunctions {(j)n{x)} spanning B^^ is a direct 
consequence of B^^ being a RKHS, and does not require the 
NST unless an explicit form for (x) is desired. Finally, strict 
adherence to NST requires an infinite number of samples to 
reconstruct f e B^^. Alternatively, the Representer Theorem 
fits / G to a finite set of (possibly noisy) samples by 
regularizing the power of /. 

HI. Sparse additive nonparametric modeling 

The account of sparse KBL methods begins with SpAMs 
and MKL approaches. Both model the function to be learned 
as a sparse sum of nonparametric components, and both rely 
on group Lasso to find it. The additive models considered in 
this section will naturally lend themselves to the general model 
for NBP introduced in Section IV, and used henceforth. 

A. SpAMs for High-Dimensional Models 

Additive function models offer a generalization of linear 
regression to the nonparametric setup, on the premise of 
dealing with the curse of dimensionality, which is inherent 
to learning from high dimensional data [161. 

Consider learning a multivariate function / : A' ^ R 
defined over the Cartesian product A' := A'l . . . 
of measurable spaces A'^. Let := [xi,...,xp] denote a 
point in X, ki the kernel defined over x A'^, and Hi its 
associated RKHS. Although /(x) can be interpolated from 
data via ([B after substituting x for x, the fidelity of Q is 
severely degraded in high dimensions. Indeed, the accuracy of 
depends on the availability of nearby points x^, where the 
function is fit to the (possibly noisy) data Zn- But proximity 
of points Xn in high dimensions is challenged by the curse of 
dimensionality, demanding an excessively large dataset. For 
instance, consider positioning N datapoints randomly in the 
hypercube [0, 1]^, repeatedly for P growing unbounded and 
N constant. Then limp^oo minn/n' E||xn — x^'H = 1; that 
is, the expected distance between any two points is equal to 
the side of the hypercube |16|. 

To overcome this problem, an additional modeling assump- 
tion is well motivated, namely constraining /(x) to the family 
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of separable functions of the form 

p 



(8) 



with Ci e Hi depending only on the i-th entry of x, as in 
e.g., linear regression models /linear(x) := Xl^^i A^i- With 
/(x) separable as in ([5]), the interpolation task is split into P 
one-dimensional problems that are not affected by the curse 
of dimensionality. 

The additive form in ^ is also amenable to subsect 
selection, which yields a SpAM. As in sparse linear regression, 
SpAMs involve functions / in ([8]) that can be expressed 
using only a few entries of x. Those can be learned using 
a variational version of the Lasso given by 



N 



arg mm 



1 ^ 



n=l 



(9) 



where J^p := {/ : A' ^ R : /(x) = ^ti c^{x,)}. 

With Xni denoting the i\h entry of x^, the Representer The- 
orem © can be applied per component Ci{xi) in ©, yielding 
kernel expansions Ci{xi) = Yl^^iJniki{xni^Xi) with scalar 
coefficients {7ni, i = 1, . . . , P, n = 1, . . . , N}. The fact 
that © yields a SpAM is demonstrated by substituting these 



expansions back into (|9]) and solving for 
to obtain 



[7. 



arg mm - 



'il, 



p 

'El 



(10) 

where is the Gram matrix associated with kernel ki, and 
IHlKi denotes the weighted ^2 -norm ||7i||Ki := (7^^Ki7i)^/^. 

B. Nonparametric Lasso 

Problem (fTOl) constitutes a weighted version of the group 
Lasso formulation for sparse linear regression. Its solution can 
be found either via block coordinate descent (BCD) [|26l . or 

1/2 

by substituting 7- = K/ 7^ and applying the alternating- 
direction method of multipliers (ADMM) JSI, with conver- 
gence guaranteed by its convexity and the separable structure 
of the its non-differentiable term [30|. In any case, group Lasso 
regularizes sub- vectors 7^ separately, effecting group- sparsity 
in the estimates; that is, some of the vectors 7^ in (fTOl) end up 
being identically zero. To gain intuition on this, (fTOl) can be 

1 /2 

rewritten using the change of variables K-^ 7i = ti^i, with 
ti > {) and ||ui|| = 1. It will be argued that if ji exceeds a 
threshold, then the optimal ti and thus 7^ will be null. Focusing 
on the minimization of (fTOl) w.r.t. a particular sub-vector 7^, as 
in a BCD algorithm, the substitute variables ti and should 
minimize 

1 1 /o 2 

-^ti (11) 



K/ tiWi 



where := z — ^j^i^jlj- Minimizing (fTTl) over is a 
convex univariate problem whose solution lies either at the 
border of the constraint, or, at a stationary point; that is. 



ti = max < 0, 



zFK 



1/2 



(12) 



The Cauchy-Schwarz inequality implies that zf K^^u^ < 

1/2 

||K;/ z^ll holds for any with ||ui|| = 1. Hence, it follows 
from ([12]) that if /i > ||Kj/^Zi||, then ti = 0, and thus 7^ = 0. 

The sparsifying effect of © on the additive model © is 
now revealed. If ji is selected large enough, some of the 
optimal sub-vectors 7^ will be null, and the corresponding 
functions Ci{xi) = Yln^ilm^ixni^Xi) will be identically 
zero in ([S]). Thus, estimation via Q provides a nonparametric 
counterpart of Lasso, offering the flexibility of selecting the 
most informative component-function regressors in the addi- 
tive model. 

The separable structure postulated in dS} facilitates subset 
selection in the nonparametric setup, and mitigates the problem 
of interpolating scattered data in high dimensions. However, 
such a model reduction may render dS} inaccurate, in which 
case extra components depending on two or more variables 
can be added, turning dU) into the ANOVA model II2TII . 

C. Multi-Kernel Learning 

Specifying the kernel that "shapes" and thus judi- 

ciously determines / in ([T]) is a prerequisite for KBL. Different 
candidate kernels /ci, . . . , /cp would produce different function 
estimates. Convex combinations can be also en^loyed in O, 
since elements of the convex hull IC \= {k = (^i^ii ^ 

0, X]iLi^i = l} conserve the defining properties of kernels. 

A data-driven strategy to select "the best" /c G /C is to 
incorporate the kernel as a variable in (|3), that is [|T9l 



N 

f = arg min y](^n - f{xn)f + m|| 



(13) 



where the notation 1-L% emphasizes dependence on k. 

Then, the following Lemma brings MKL to the ambit of 
sparse additive nonparametric models. 

Lemma 1 ([23]): Let {/ci, . . . , /cp} be a set of kernels and 
k an element of their convex hull JC. Denote by Hi and n% 
the RKHSs corresponding to ki and k, respectively, and by 
Hx the direct sum := Hi © . . . © Hp. It then holds that: 



a) n%=nx. yk e JC; and 

b) V /, inf{||/||^. : keJC} = min{Eti \\c^\\n. 



f = 



Ei^i^i, Ci e Ui}. 

According to Lemma [T] T-Lx can replace H^ in ([13]), 
rendering it equivalent to 



N 



f = arg min V(^n - /(^n))^ + M 
j^tix — ; 



P 

El 



(14) 



s. to {/ = Ci e Ui, Ux '= T-Li 



^np}. 



MKL as in (IT4l) resembles Q, differing in that components 
Ci{x) in (O depend on the same variable x. Taking into 
account this difference, JT4l) is reducible to dTOl ) and thus 
solvable via BCD or ADMoM, after substituting ki{xn-,x) for 
ki{xni,Xi). On the other hand, a more general case of MKL 
is presented in ||23]| , where JC is the convex hull of an infinite 
and possibly uncountable family of kernels. 
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An example of MKL applied to wireless communications 
is offered in Section IVIIl where two different kernels are 
employed for estimating path-loss and shadowing propagation 
effects in a cognitive radio sensing paradigm. 

In the ensuing section, basis functions depending on a 
second variable y will be incorporated to broaden the scope 
of the additive models just described. 

IV. NONPARAMETRIC BASIS PURSUIT 

Consider function f \ X x y ^ ^ over the Cartesian 
product of spaces X and y with associated RKHSs Hx and 
Hy, respectively. Let / abide to the bilinear expansion form 

p 

f{x,y) = J2c,{xMy) (15) 

where bi : y ^ R can be viewed as bases, and Ci : X ^ 
R as expansion coefficient functions. Given a finite number 
of training data, learning {ci^bi} under sparsity constraints 
constitutes the goal of the NBP approaches developed in the 
following sections. 

The first method for sparse KBL of / in ([T5]) is related 
to a nonparametric counterpart of basis pursuit, with the 
goal of fitting the function f{x^y) to data, where {bi} are 
prescribed and {ci}s are to be learned. The designer's degree 
of confidence on the modeling assumptions is key to deciding 
whether {bi}s should be prescribed or learned from data. 
If the prescribed {bi}s are unreliable, model (ITSl) will be 
inaccurate and the performance of KBL will suffer. But 
neglecting the prior knowledge conveyed by {bi}s may be also 
damaging. Parametric basis pursuit [9 1 hints toward addressing 
this tradeoff by offering a compromising alternative. 

A functional dependence z = f{y)-\-e between input y and 
output z is modeled in [9 1 with an overcomplete set of bases 
{bi{y)} (a.k.a. regressors) as 



^^Cibiiy) 



'A/'(0,a'). 



(16) 



Certainly, leveraging an overcomplete set of bases {bi{y)} 
can accommodate uncertainty. Practical merits of basis pursuit 
however, hinge on its capability to learn the few {bi}s that 
"best" explain the given data. 

The crux of NBP on the other hand, is to fit /(x, y) with a 
basis expansion over the y domain, but learn its dependence 
on X through nonparametric means. Model ([T5l l comes handy 
for this purpose, when {bi{y)}fLi is a generally overcomplete 
collection of prescribed bases. 

With {bi{y)}fLi known, {ci{x)}fLi need to be estimated, 
and a kernel-based strategy can be adopted to this end. 
Accordingly, the optimal function /(x, y) is searched over the 
family J^, := {f{x^y) = YliLi which constitutes 

the feasible set for the NBP-tailored nonparametric Lasso [cf. 
®] 



N 

arg min > (zn 

n=l 



f{Xn,yn)f 



P 

El 



(17) 



The Representer Theorem in its general form (|3) can be 
applied recursively to minimize (ITTI ) w.r.t. each Ci {x) at a time. 



rendering / expressible in terms of the kernel expansion as 
f{x,y) = Eili En=i7m^(^n,^)^i(^), whcrc coefficients 
ll •= [ill •>••••> Iin] are learned from data := [zi, . . . , zat] 
via group Lasso [cf. ([Tq1) 1 



mm 



l7i||K 



(18) 



with := T)mg[bi{yi), . . . , 6i(?/Ar)]K. 

As it was argued in Section III, group Lasso in ([TSl) 
effects group- sparsity in the subvectors {7i}^i. This property 
inherited by (fTTl) is the capability of selecting bases in the 
nonparametric setup. Indeed, by zeroing 7^ the corresponding 
coefficient function q(x) = Yln^ilin^i^nix) is driven to 
zero, and correspondingly bi{y) drops from the expansion ([T5l ). 
Remark 2. A single kernel kx and associated RKHS 1-Lx can 
be used for all components Ci{x) in (fTTl) , since the summands 
in ([T5]) are differentiated through the bases. Specifically, for 
a common K, a different bi{y) per coefficient Ci{x), yields 
a distinct diagonal matrix Diag [6^(7/1), . . . , 6i(^Ar)], defining 
an individual in (fTSl) that renders vector 7^ identifiable. 
This is a particular characteristic of dTTt . in contrast with (O 
and Lemma [T] which are designed for, and require, multiple 
kernels. 

Remark 3. The different sparse kernel-based approaches 
presented so far, namely SpAMs, MKL, and NBP, should not 
be viewed as competing but rather as complementary choices. 
Multiple kernels can be used in basis pursuit, and a separable 
model for Ci{x) may be due in high dimensions. An NBP- 
MKL hybrid applied to spectrum cartography illustrates this 
point in Section Ivnl where bases are utilized for the frequency 
domain y. 

V. Blind NBP for matrix and tensor completion 

A kernel-based matrix completion scheme will be developed 
in this section using a blind version of NBP, in which bases 
{bi} will not be prescribed, but they will be learned to- 
gether with coefficient functions {c^}. The matrix completion 
task entails imputation of missing entries of a data matrix 
Z G R^><^. Entries of an index matrix W G {0,1}^><^ 
specify whether datum Zmn is available {Wmn = 1), or missing 
(^mn = 0). Low rank of Z is a popular attribute that relates 
missing with available data, thus granting feasibility to the 
imputation task. Low-rank matrix imputation is achieved by 
solving 



arg min -||(Z 



A) W|||. s. to rank(A) < P 



(19) 



where stands for the Hadamard (element-wise) product. 
The low-rank constraint corresponds to an upperbound on the 
number of nonzero singular values of matrix A, as given 
by its ^o-norm. Specifically, if := [si, . . . , 5min{M,Ar}] 
denotes vector of singular values of A, and the cardinality 
\{si 0, i = l,...,min{M,7V}}| := ||s||o defines its 4- 
norm, then the ball of radius P, namely ||s||o < P, can replace 
rank(A) < P in ([T9l ). The feasible set ||s||o < P is not convex 
because ||s||o is not a proper norm (it lacks linearity), and 
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solving ([T9b requires a combinatorial search for the nonzero 
entries of s. A convex relaxation is thus well motivated. If the 
£o-norm is surrogated by the ^i-norm, the corresponding ball 
II s 111 < P becomes the convex hull of the original feasible set. 
As the singular values of A are non-negative by definition, it 
follows that II s 111 = ^^n{M,Ar} ^_ ^{yicq the sum of singular 
values equals the dual norm of the ^2-norm of A |5, p. 637], 
||s||i defines a norm over the matrix A itself, namely the 
nuclear norm of A, denoted by ||A||*. 

Upon substituting ||A||* for the rank, ([T9] ) is further trans- 
formed to its Lagrangian form by placing the constraint in the 
objective as a regularization term, i.e., 

Z = arg min ^ ||(Z - A) W||| + /i|| A||*. (20) 

The next step towards kernel-based matrix completion re- 
lies on an alternative definition of ||A||*. Consider bilinear 
factorizations of matrix A = CB^ with B G R^^^ and 
C G R^^^, in which the constraint rank(A) < P is implicit. 
The nuclear norm of A can be redefined as (see e.g., [i221 ) 



inf -( 

A^CBT 2 



|B||^ 



|C||^). 



(21) 



Result ([2TI) states that the infimum is attained by the singular 
value decomposition of A. Specifically, if A = U5]V^ with 
U and V unitary and X) := diag(s), and if B and C are 
selected as B = VX:V2, q ^ U5]V2, ^^^^ |(||B|||. + 

W^Wf) = E^d<5^ = ||A||*. Given (EB, it is possible to 
rewrite (l2Ql ) as 



Z = arg min - 1 

^A=CB^2' 



(Z-A)0W||^ 



(22) 

A formal proof of the equivalence between (l2Ql) and (l22l) can 
be found in ll22l . 

Matrix completion in its factorized form (|22]| can be 
reformulated in terms of (fTSl l and RKHSs. Following 131 , 
define spaces X := {1, . . . , M} and y := {1, . . . , TV} with 
associated kernels kx{m^ m') and ky{n^ n')^ respectively. Let 
f{m,n) represent the (m,n) -th entry of the approximant 
matrix A in (l22l) , and P a prescribed overestimate of its rank. 
Consider estimating / : A' x 3^ ^ R in ([T5]) over the family 

:= {/(m, n) = Yli=i Ci{n)bi{m), ci e Ux, hi e Uy} via 



^ M N 

f argmin- ^ ^Wmn{zmn - f{m,n)f 

m—l n=l 

+ f EdKllL + ll&illli,)- (23) 



If both kernels are selected as Kronecker delta functions, 
then (|23l) coincides with (1221) . This equivalence is stated in 
the following lemma. 

Lemma 2: Consider spaces X := {1,...,M}, y := 
{1,...,A^} and kernels kx{m^m') := 5{m — m') and 
ky{n^n') := 5{n — n') over the product spaces X x X and 
3^ X 3^, respectively. Define functions / : A* x 3^ ^ R, 
Q : A* ^ R, and 6^ : 3^ ^ R, z = 1, . . . , P, and matrices 
A G R^x^, B G R^><^, and C G R^^^. It holds that: 



a) RKHS Hx (3~Ly) of functions over X (correspondingly 
3^), associated with kx (ky) reduce to Hx = 

my = R^). 

b) Problems (|23]) , (l22l) . and (l2Ql) are equivalent upon identi- 
fying f{m,n) Aran, h{n) B^i, and ^(m) Cmi- 

According to Lemma [21 the intricacy of rewriting (l2Qt as in 
does not introduce any benefit when the kernel is selected 
as the Kronecker delta. But as it will be argued next, the 
equivalence between these two estimators generalizes nicely 
the matrix completion problem to sparse KBL of missing data 
with arbitrary kernels. 

The separable structure of the regularization term in 
enables a finite dimensional representation of functions 



M 



Ci{m) = ^ Jim'kx{m\m), m = l,...,M, 



N 



hi{n) = ^ I3in'ky{n ,n), n = l,...,N. 



(24) 



n' = l 



Optimal scalars {jim} and {Pin} are obtained by substitut- 
ing (l24l) into (I23]) , and solving 



. min -||(Z-K;,CB^K^)©Wrf 



trace(C^K;^C) +trace(B^K^B) 



(25) 



where matrix C (B) is formed with entries jrni (f^ni)- 

A Bayesian approach to kernel-based matrix completion is 
given next, followed by an algorithm to solve for B and C. 

A. Bayesian Low-Rank Imputation and Prediction 

To recast (1231) in a Bayesian framework, suppose that 
the available entries of Z obey the additive white Gaussian 
noise (AWGN) model Z = A + E, with E having entries 
independent identically distributed (i.i.d.) according to the 
zero-mean Gaussian distribution A/'(0,cr^). 

Matrix A is factorized as A = CB^ without loss of 
generality (w.l.o.g.). Then, a Gaussian prior is assumed for 
each of the columns and of B and C, respectively. 



b, -Ar(0,RB), c, -A/'(0,Rc) 



(26) 



independent across and with trace(RB) = trace(Rc). 
Invariance across i is justifiable, since columns are a priori 
interchangeable, while trace(RB) = trace(Rc) is introduced 
w.l.o.g. to remove the scalar ambiguity in A = CB^. 

Under the AWGN model, and with priors (|26] |, the maxi- 
mum a posteriori (MAP) estimator of A given Z at the entries 
indexed by W takes the form [cf. (|25]) 1 

min J||(Z-CB^)©Wf^ 



[trace(C^R^^C) +trace(B^R^^B)] . 



(27) 

With He = K;f and R^ = K^, and substituting B := 
KyB and C := KxC, the MAP estimator that solves (|27]) 
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Algorithm 1 : Kernel Matrix Completion (KMC) 
1: Initialize B and C randomly. 

2: Set the identity matrix Ip, with dimensions P x P, and columns 
e^, i = 1, . . . ,P 

3: while |cost — cost_old| < e do 

4: for i = 1 , . . . , P do 

5: Set Zi := Z - C(Ip - e^ef )B^ 

6: Compute := Diag[W(Bei Be^] + mK~^ 

7: Update column d = H^^W Zi)Bei 

8: end for 

9: for i = 1,. . . ,P do 
10: Set Z^ := Z - C(Ip - e^ef )B^ 

11: Compute := Diag[W^(Ce^ Ce^)] + /xK^^ 

12: Update column = H7^(W^ zf)Cei 

13: end for 

14: Recalculate cost = |||(Z - CB^) W\\l 

15: +f [trace(C^K-^C) + trace(B^K-^B)] 

16: end while 

17: return B = R-^B, C = K'^C, and Z = CB^ 



coincides with the estimator solving (1251) for the coefficients of 
kernel-based matrix completion, provided that covariance and 
Gram matrices coincide. From this Bayesian perspective, the 
KBL matrix completion method ([23l) provides a generalization 
of ([2Qb . which can accommodate a priori knowledge in the 
form of correlation across rows and columns of the incomplete 
Z. 

With prescribed correlation matrices and Rc, (l23l) can 
even perform smoothing and prediction. Indeed, if a column 
(or row) of Z is completely missing, ([23l) can still find an 
estimate Z relying on the covariance between the missing and 
available columns. This feature is not available with (l2Ql) , since 
the latter relies only on rank-induced colinearities, so it cannot 
reconstruct a missing column. The prediction capability is 
useful for instance in collaborative filtering [3 1, where a group 
of users rates a collection of items, to enable inference of new- 
user preferences or items entering the system. Additionally, the 
Bayesian reformulation (|27]) provides an explicit interpretation 
for the regularization parameter /i = as the variance of 
the model error, which can thus be obtained from training 
data. The kernel-based matrix completion method (l27l) is sum- 
marized in Algorithm [T] which solves (l27l) upon identifying 
Rc = K;f, Hb = K^, and cr^ = /i, and solves ([25l) after 
changing variables B := K^B and C := K;fC (compare 
(1251) with lines 13-14 in Algorithm [B. 

Detailed derivations of the updates in Algorithm [T] are 
provided in the Appendix. For a high-level description, the 
columns of B and C are updated cyclically, solving (1271) via 
BCD iterations. This procedure converges to a stationary point 
of ([27l) , which in principle does not guarantee global optimal- 
ity. Opportunely, it can be established that local minima of 
([27l) are global minima, by transforming (l27l l into a convex 
problem through the same change of variables proposed in 
II22II for the analysis of (l22l) . This observation implies that 
Algorithm [T] yields the global optimum of (1251) . and thus (1231) . 

The kernel-based matrix completion method here offers an 
alternative to O, where the low-rank constraint is introduced 
indirectly through the kernel trick. Furthermore, bypassing the 
nuclear norm and using (1211 instead, renders (|23) generalizable 
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Fig. 1. Comparison between KDL and NBP; (top) dictionary B and sparse 
coefficients 7m for KDL, where MNs equations are sufficient to recover C; 
(bottom) low-rank structure A = CB-^ presumed in KMC. 

to tensor imputation JTl. 

VI. Kernel-based dictionary learning 

Basis pursuit approaches advocate an overcomplete set of 
bases to cope with model uncertainty, thus learning from data 
the most concise subset of bases that represents the signal 
of interest. But the extensive set of candidate bases (a.k.a. 
dictionary) still needs to be prescribed. The next step towards 
model-agnostic KBL is to learn the dictionary from data, along 
with the sparse regression coefficients. Under the sparse linear 
model 

Zm = B'Jrn + 6,^, m = 1, . . . , M (28) 

with dictionary of bases B G ^ ^ , and vector of coefficients 
'jm G M^, the goal of dictionary learning is to obtain B and 
C := [71, ... , 7m]^ from data Z := [zi, . . . , zm]^- A swift 
count of equations and unknowns yields NP + MP scalar 
variables to be learned from MN data (see Fig. [T]). This goal 
is not plausible for an overcomplete design (P > N) unless 
sparsity of {7m}m=i exploited. Under proper conditions, 
it is possible to recover a sparse 7^ containing at most S 
nonzero entries from a reduced number Ng := OSlogP < N 
of equations ID, where 6> is a proportionality constant. Hence, 
the number of equations needed to specify C reduces to 
MNs, as represented by the darkened region of in Fig. 
[TJ With Ns < N, it is then possible and crucial to collect a 
sufficiently large number M of data vectors in order to ensure 
that MN > NP^MNs, thus accommodating the additional 
NP equations needed to determine B, and enable learning of 
the dictionary. 

Having collected sufficient training data, one possible ap- 
proach to find B and C is to fit the data via the LS cost 
II Z — CB^|||. regularized by the ^i-norm of C in order to 
effect sparsity in the coefficients ll20l . This dictionary leaning 
approach can be recast into the form of blind NBP (l23l) by in- 
troducing the additional regularizing term A^liLi IkiHi, with 
W^iWi '— Ylm=i The new regularizer on functions 

Ci : X depends on their values at the measurement points 
m only, and can be absorbed in the loss part of (|3]). Thus, the 
optimal {ci} and {bi} conserve their finite expansion repre- 
sentations dictated by the Representer Theorem. Coefficients 
{jmp^Pnp} must be adapted according to the new cost, and 
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dZTl) becomes 

min i||(Z - CB^) © W\\l + A||C||i (29) 

2 

+ y [trace(B^R-^B) +trace(C^R^^C)] . 

Remark 4. Kernel-based dictionary learning (KDL) via ( [29l) 
inherits two attractive properties of kernel matrix completion 
(KMC), that is blind NBP, namely its flexibility to introduce 
a priori information through and Rc, as well as the 
capability to cope with missing data. While both KDL and 
KMC estimate bases {bi} and coefficients {ci} jointly, their 
difference lies in the size of the dictionary. As in principal 
component analysis, KMC presumes a low-rank model for 
the approximant A = CB^, compressing signals {z^} with 
P' < M components (Fig. [T] (bottom)). Low rank of A is not 
required by the dictionary learning approach, where signals 
{z^} are spanned by P > M dictionary atoms {hi} (Fig. [T] 
(top)), provided that each z^ is composed by a few atoms 
only. 

Algorithm [T] can be modified to solve (l29l) by replacing the 
update for column in line 7 with the Lasso estimate 

Ci := arg min -c^H^c + c^(W ZjBe^ + A||c||i. (30) 

cGR^ 2 

The Bayesian interpretation of (|29] ) brings KDL close to 
1341 . where a Bernoulli-Gaussian model for C accounts for its 
sparsity, and a Beta distribution is introduced for learning the 
distribution of C through hyperparameters. Although [|34l as- 
sumes independent Gaussian variables across "time" samples 
in the underlying model for C, generalization to correlated 
variables is straightforward. Bernoulli parameters controlling 
the sparsity of Cmp are assumed invariant across m in 1341 , 
which amounts to stationarity over Cmp- 

Sparse learning of temporally correlated data is studied also 
in 1351 , although the time-invariant model for the support of 
Cm does not lend itself to dictionary learning. 

Although dictionary learning can indeed be viewed as a 
blind counterpart of compressive sampling, its capability of 
recovering B and C from data is typically illustrated by 
examples rather than theoretical guarantees. Recent efforts on 
establishing identifiability and local optimality of dictionary 
learning can be found in lfT3l and ifTSl . A related KDL strategy 
has been proposed in l|28l , where data and dictionary atoms 
are organized in classes, and the regularized learning criterion 
is designed to promote cohesion of atoms within a class. 

VIL Applications 

A. Spectrum cartography via NBP and MKL 

Consider the setup in |6| with Nc = 100 radios distributed 
over an area A' of 100 x lOOm^ to measure the ambient 
RF power spectral density (PSD) at Nf = 24 frequencies 
equally spaced in the band from 2,400MHz to 2,496MHz, 
as specified by IEEE 802.11 wireless LAN standard EJ. The 
radios collaborate by sharing their N = NcNf measurements 
with the goal of obtaining a map of the PSD across space 
and frequency, while specifying at the same time which of 



the P = 14 frequency sub-bands are occupied. The wireless 
propagation is simulated according to the pathloss model af- 
fected by shadowing described in JH, with parameters Up = 3, 
Ao = 60m, S = 25m , = 2bdB, and with AWGN variance 
cr^ = —lOdB. Fig. [21 depicts the distribution of power across 
space generated by two sources transmitting over bands i = 5 
and i = 8 with center frequencies 2, 432MHz and 2, 447MHz, 
respectively. Fig. [3] shows the PSD as seen by a representative 
radio located at the center of X. 




Fig. 2. Aggregate power distri- Fig. 3. PSD measurements at a 
bution across space. representative location Xn- 

Model (fTSl) is adopted for collaborative PSD sensing, with 
X and y representing the spatial and frequency variables, 
respectively. Bases {bi} are prescribed as Hann- windowed 
pulses in accordance with [2], and the distribution of power 
across space per sub-band is given by {q(x)} after interpo- 
lating the measurements obtained by the radios via ([TtI) . Two 
exponential kernels kr{x^x') = exp(— — x'|p/6>^), r = 1,2 
with Oi = 10m and O2 = 20m are selected, and convex 
combinations of the two are considered as candidate inter- 
polators k{x^ x'). This MKL strategy is intended for capturing 
two different levels of resolution as produced by pathloss 
and shadowing. Correspondingly, each Ci{x) is decomposed 
into two functions Cii{x) and Ci2{x) which are regularized 
separately in ([Tt]) . 

Solving (fTTl) generates the PSD maps of Fig. (H Only 75 and 
78 in the solution to (fTSl) take nonzero values (more precisely 
75^ and 78r, r = 1,2 in the MKL adaptation of (fTSl)). 
which correctly reveals which frequency bands are occupied 
as shown in Fig. |4] (first row). The estimated PSD across space 
is depicted in Fig. H] (second row) for each band respectively, 
and compared to the ground truth depicted in Fig. |4] (third 
row). The multi-resolution components c^r{x) and csr{x) are 
depicted in Fig. H] (last two rows), demonstrating how kernel 
ki captures the coarse pathloss distribution, while /c2 refines 
the map by revealing locations affected by shadowing. 

These results demonstrate the usefulness of model (ITSl) 
for collaborative spectrum sensing, with bases abiding to 
[2] and multi-resolution kernels. The sparse nonparametric 
estimator iVfl serves the purpose of revealing the occupied 
frequency bands, and capturing the PSD map across space 
per source. Compared to the spline-based approach in B, the 
MKL adaptation of (fTTl) here provides the appropriate multi- 
resolution capability to capture pathloss and shadowing effects 
when interpolating the data across space. 



IEEE SIGNAL PROCESSING MAGAZINE, 2013 (TO APPEAR) 



9 




Fig. 4. NBP for spectrum cartography using MKL. 

B. Completion of Gene Expression Data via Blind NBP 

The imputation method (|23]) is tested here on microarray 
data described in [27|. Expression levels of yeast across 
Ng = 4:^ 772 genes sampled at = 13 time points during 
the cell cycle are considered. A subset of M = 100 gene 
is extracted and their expression levels are organized in th 
matrix Z eR^^^ depicted in Fig. [5] (left). Severe data losse 
are simulated by discarding 90% of the entries of Z, includin 
the nearly 5% actually missing data. 

According to the Bayesian model (l26l) , it follows that 

E[ZZ^] = OKc + crll, E[Z^Z] = OKb + crll . (31 

To study the effect of hydrogen peroxide on the cell cycl 
arrest, two extra microarray datasets Z^^\ Z^^^ G R^^^ 
synchronized with Z, are collected in [i27i . These two matrice 
are employed to form an estimate of J5[ZZ^], which is use 
instead of Rc in (IZTl) after neglecting the noise term in (l3ll . 
Since the presence of hydrogen peroxide in samples Z^^^ and 
Z^^^ induces cell cycle arrest, the correlation between samples 
across time in Z^^^ and Z^^^ is altered, and thus these samples 



are not appropriate for estimating £^[Z^Z]. Alternatively, the 
sample estimate of ^[Z^Z] is formed with the microarray data 
of the [Ng — M) X TV genes set aside, and then used in place 
of Kb in (ED). 

Solving ^ with the available data (10% of the total) as 
shown in Fig. \5\ (second left) results in the matrix Z depicted 
in Fig. \5\ (second right), where the imputed missing data 
introduce an average recovery error of — 8dB [cf. Fig. [6l. 
In producing Z, the smoothing capability of ([23l) to recover 
completely missing rows of Z (amounting to 25 in this 
example) is corroborated. Missing rows cannot be recovered 
by nuclear norm regularization alone [cf. (120^1. even if Z is 
padded with expression levels of the discarded Ng — M genes. 
Fig. [5] (right) presents this case confirming that its performance 
dagrades w.r.t. NBP; while Fig. [6] illustrates the sensitivity 
of the estimation error to the cross-validated regularization 
parameter /i for both estimators. Similar degraded results 
are observed when imputing missing entries of Z using the 
impute. knn() and svdlmpute() methods, as implemented in 
the R packages pcaMethods and BioConductor-impute. These 
two methods were applied to the padded Z, after the requisite 
discarding of the 25 missing rows, resulting in recovery errors 
on the remaining missing entries at — 3.84dB and — 0.12dB 
(with parameter nPcs= 12), respectively. 




Fig. 5. Microarray data completion; from left to right: original sample; 10% 
available data; recovery via NBP; and recovery via nuclear-norm regularized 
LS. 




1^ 

Fig. 6. Relative recovery error in dB with 90% missing data; comparison 
between blind NBP (KMC) and nuclear norm regularization. 
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C. Network Flow Prediction via Blind NBP 

The Abilene network in Fig. [71 a.k.a. Internet 2, comprising 
11 nodes and M = 30 links [1|, is utilized as a testbed for 
traffic load prediction. Aggregate link loads Zmn are recorded 
every 5 minute intervals in the morning of December 22, 2008, 
between 12:00am and 11:55pm, and are collected in the first 
N/2 = 144 columns of matrix Z G R^^^. These samples are 
then used to predict link loads hours ahead, by capitalizing on 
their mutual cross-correlation, the periodic correlation across 
days, and their interdependence across links as dictated by the 
network topology. 

The correlation matrix E{ZZ^) represented in Fig. [8] is 
estimated with training samples collected during the two 
previous weeks, from December 8 to December 21, 2008, and 
substituted for He in (|27]) according to (|3T1) . A singular point 
at 11:00am in the traffic curve, as depicted in black in Fig. 
|9l is reflected in the sharp transition noticed in Fig. [8] On the 
other hand, is not estimated but derived from the network 
structure. Supposing i.i.d. flows across the network, it holds 
that E{Z^Z) = (TjR^R, where R represents the network 
routing matrix and a'j the flow variance. Thus, a^R^R, 
was used instead of R^ in (|27]) , with a'j adjusted to satisfy 
tr(E[Z^Z]) =tr(^[ZZ^]). 

Fig. [9] shows link loads predicted by (IZTt on December 
22, 2008, for a representative link, along with the actually 
recorded samples for that day. Prediction accuracy is compared 
in Fig. [9] to a base strategy comprising independent LMMSE 
estimators per link, which yield a relative prediction error Cp = 
0.22 aggregated across links, against Cp = 0.15 that results 
from (|27]) . Strong correlation among samples from 12:00am 
to 2:00pm [cf. Fig. [H renders LMMSE prediction accurate in 
this interval, relying on single-link data only. The benefit of 
considering the links jointly is appreciated in the subsequent 
interval from 2:00pm to 11:55pm, where the traffic correlation 
with morning samples fades away and the network structure 
comes to add valuable information, in the form of R^, to 
stabilize prediction. 




Fig. 7. Internet 2 network topology graph (T]. 



VIII. Summary 

A new methodology was outlined in this paper by cross 
fertilizing sparsity-aware signal processing tools with kernel- 
based learning. It goes well beyond translating sparse vector 
regression techniques into their nonparametric counterparts, to 
generate a series of unique possibilities such as kernel selec- 
tion or kernel-based matrix completion. The present article 




Fig. 8. Sample estimates of E{ZZ'^) for link loads across time, are used 
to replace Rc and K^;. 
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Fig. 9. Network prediction via KMC (blind NBP). Measured and predicted 
traffic on link m = 21. 

contributes to these efforts by advancing NBP as the corner- 
stone of sparse KBL, including blind versions that emerge 
as nonparametric nuclear norm regularization and dictionary 
learning. 

KBL was connected with GP analysis, promoting a 
Bayesian viewpoint where kernels convey prior information. 
Alternatively, KBL can be regarded as an interpolation toolset 
though its connection with the NST, suggesting that the impact 
of the prior model choice is attenuated when the size of 
the dataset is large, especially when kernel selection is also 
incorporated. 

All in all, sparse KBL was envisioned as a fruitful research 
direction. Its impact on signal processing practice was illus- 
trated through a diverse set of application paradigms. 

Appendix 
Proofs of Properties P1-P3 

Proof: 1) If white noise n{x) : x G M is fed to an 
ideal low-pass filter with cutoff frequency c^max = tt, then 
r(^) := E{z{x)z{x + ^)) = sinc(^) is the autocorrelation of 
the output z{x). Hence, K equals the covariance matrix of 
:= [z{xi)^ . . . , z{xn)], and as such K ^ 0. ■ 

Proof: 2) Rewrite the kernel fx'{x) := sinc(x — x') as 
a function parameterized by x' . Then, the NST applied to 
the bandlimited /^'{x) yields fx'{x) = J^lnez fx'{n)smc{x - 
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Proof: 3) Upon defining := f{xn), the reconstruction 
formula f{x) := ^^^^ /(n)sinc(x — n) gives the kernel 
expansion of / G . Hence, by definition of the RKHS norm 
WfWu;^ = EnGzEn'Gz/Hsinc(n-nO/(nO. Substituting 
the reconstructed /(n) = X]n'GZ^^^^(^ ~ ^0/(^0 i^^o the 
last equation yields ||/|||^^ = f^^^^ /^(n). ■ 

Design of Algorithm 1 

In order to rewrite the cost - CB^) W|||. + 

f [Tr(C^K-^C) +Tr(B^K-^B)] in terms = Ce^ and 
hi = Be^, representing the z-th columns of matrix B and 
C, respectively, define = C — c^e^^ and decompose 
CB^ = C^B^ + c^b^^. Then rewrite the cost as 

i||(Z, - cM^) will + |c,^K-ic, (32) 

after defining := Z — C^B^ and discarding regularization 
terms not depending on c^. 

Let vec(W) denote the vector operator that concatenates 
columns of W, and D := Diag[x] the diagonal matrix 
operator such that da = Xi. The Hadamard product can 
be bypassed by defining T>w •= Diag[vec(W)], substituting 
||X||f = ||vec(X)||2, and using the following identities 

vec(W0X) = DH^vec(X), 
vec(X,b,^) = (b, lM)vec(X,) (33) 

with representing the Kroneker product. Applying (l33l l to 
(l32l) yields 

i||DH^vec(Z,) - DH.(b, Im)c,||^ + |c,^K-^c, (34) 

Equating the gradient of (l34l) w.r.t. to zero, and solving 
for Ci it results 

Ci = H^^bi^ iM)D^vec(Zi) 

Hi := b^^ lM)J^w^w{hi^ ^ Im) + /iK-^ (35) 

It follows from (|33]) that (b^^ iM)DH^vec(Zi) = 
(W Zi), and it can be established by inspection that 
(b,^ lM)DH/DH^(b,^ Im) = En=i ^fnDiag[wJ = 
Diag[W(bi 0bi)], so that ^ reduces to 
c, = (Diag[W(b,0b,)]+/iK-^)"'(W ZOb„ 
coinciding with the update for in Algorithm 1. The 
corresponding update for b^ follows from parallel derivations. 

References 



[7] J. A. Bazerque, G. Mateos, and G. B. Giannakis, "Nonparametric low- 
rank tensor imputation," IEEE Workshop on Stat. Signal Proc, Ann Arbor, 
MI, Aug. 5-8, 2012. 

[8] E. J. Candes, and T. Tao, "Decoding by linear programming," IEEE Trans, 
on Info. Theory, vol. 51, no. 12, pp. 4203-4215, Dec. 2005. 

[9] S. S. Chen, D. L. Donoho, and M. A. Saunders, "Atomic decomposition 
by basis pursuit," SI AM J. Sci. Computing, vol. 20, no. 1, pp. 33-61, Dec. 
1998. 

[10] N. Cressie, Statistics for Spatial Data, Wiley, 1991. 

[11] J. Duchon, Splines Minimizing Rotation-Invariant Semi-norms in 

Sobolev Spaces, New York: Springer- Verlag, 1977. 
[12] M. Fazel, "Matrix rank minimization with applications" PhD Thesis, 

Electrical Engineering Dept., Stanford University, vol. 54, pp. 1-130, 

2002. 

[13] Q. Geng and J. Wright, "On the local correctness of -minimization 

for dictionary learning," IEEE Trans, on Info. Theory, 2011 (submitted); 

see arXiv: 1101.5672 /1 [cs.IT]. 
[14] F. Girosi, "An equivalence between sparse approximation and support 

vector machines," Neural Computation vol. 10, no. 6, pp. 1455-1480, 

Aug. 1998. 

[15] R. Gribonval and K. Schnass, "Dictionary identification - sparse matrix 

factorization via -minimization" IEEE Trans, on Info. Theory, vol. 56, 

no. 7, pp. 3523 - 3539, July 2010. 
[16] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical 

Learning, 2nd ed.. Springer, NY, 2009. 
[17] S. Kay, Fundamentals of Statistical Signal Processing, vol. 1, Prentice 

Hall, 2001. 

[18] V. Kekatos, S. Veeramachaneni, M. Light, and G. B. Giannakis, "Day- 
ahead electricity market forecasting using kernels," Proc. of lEEE-PES on 
Innovative Smart Grid Technologies, Washington, DC, Feb. 24-27, 2013. 

[19] V. Koltchinskii and M. Yuan, "Sparsity in multiple kernel learning," 
Annals of Statistics vol. 38, no. 6, pp. 3660-3695, Apr. 2010. 

[20] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. W. Lee, and T. 
J. Sejnowski, "Dictionary learning algorithms for sparse representation," 
Neural Computation, vol. 15, no. 2, pp. 349-396, Feb. 2003. 

[21] Y. Lin and H. H. Zhang, "Component selection and smoothing in 
multivariate nonparametric regression," Annals of Statistics, vol. 34, no. 
5, pp. 2272-2297, May 2006. 

[22] M. Mardani, G. Mateos, and G. B. Giannakis, "In-network sparsity- 
regularized rank minimization: Algorithms and applications," IEEE Trans, 
on Signal Proc, 2012; see also arXiv: 1203. 1507 /1 [cs.MA]. 

[23] C. Micchelli and M. Pontil, "Learning the kernel function via regular- 
ization," /. Machine Learning Res., vol. 6, pp. 1099-1125, Sep. 2005. 

[24] M. Z. Nashed and Q. Sun, "Function spaces for sampling expansions," 
Multiscale Signal Analysis and Modelling, edited by X. Shen and A. 
Zayed, Lecture Notes in EE, Springer, pp. 81-104, 2012. 

[25] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine 
Learning, the MIT Press, 2006. 

[26] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman, "Sparse additive 
models," /. Roy. Stat. Soc. B, vol. 71, no. 5, pp. 1009-1030, Oct. 2009. 

[27] M. Shapira, M. E. Segal, and D. Botstein, "Disruption of yeast forkhead- 
associated cell cycle transcription by oxidative stress," Molecular Biology 
of the Cell, vol. 15, no. 12, pp. 5659-5669, Dec. 2004. 

[28] A. Shrivastava, H. V. Nguyen, V. M. Patel, and R. Chellappa, "Design 
of non-linear discriminative dictionaries for image classification," Proc. 
of Asian Conf. on Computer Vision, Daejeon, Korea, 2012. 

[29] V. Sindhwani and A. C. Lozano, "Non-parametric group orthogonal 
matching pursuit for sparse learning with multiple kernels," Advances in 
Neural Information Processing Systems, pp. 2519-2527, Granada, Spain, 
2011. 

[30] P. Tseng and S. Yun, "A coordinate gradient descent method for 
nonsmooth separable minimization," /. Mathematical Programming, vol. 
117, no. 1-2, pp. 387-423, Mar. 2009. 

M. Unser, "Splines: A perfect fit for signal and image processing," IEEE 
Signal Proc. Magazine, vol. 16, no. 6, pp. 22-38, Nov. 1999. 

P. Vincent and Y. Bengio, "Kernel matching pursuit," Machine Learning, 
vol. 48, pp. 169-191, 2002. 

G. Wahba, Spline Models for Observational Data, Society for Industrial 
and AppHed Mathematics, PA 1990. 
Z. Xing, M. Zhou, A. Castrodad, G. Sapiro and L. Carin, "Dictionary 
learning for noisy and incomplete hyperspectral images," SIAM Journal 
on Imaging Sciences, vol. 5, no. 1, pp. 33-56, 2012. 
Z. Zhang, and B. D. Rao, "Sparse signal recovery with temporally 
correlated source vectors using sparse Bayesian learning," IEEE J. Sel. 
Topics in Signal Proc, vol. 5, no. 5, pp. 912-926, Sep. 2011. 



[1] [Onlme]. Available: http://mternet2.edu/observatory/archive/data-collections.html 
[2] IEEE Standard for Info. Tech.-Telecomms. and Info. Exchchange between [31 

Systems-Local and Metropolitan Area Nets., Part 11: Wir LAN MAC and 

PHY Specifications, IEEE Standard 802.11-2012, pp. 1-1184, Mar. 2012. [32 
[3] J. Abernethy, F. Bach, T. Evgeniou, and J. -P. Vert, "A new approach to 

collaborative filtering: Operator estimation with spectral regularization," [33 

/. Machine Learning Res., vol. 10, pp. 803-826, Mar. 2009. 
[4] P. Agrawal and N. Patwari, "Correlated link shadow fading in multihop [34 

wireless network," IEEE Trans, on Wireless Comm., vol. 8, no. 8, pp. 

4024-4036, Aug. 2009. 
[5] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge Univer- [35 

sity Press, 2004. 

[6] J. A. Bazerque, G. Mateos, and G. B. Giannakis, "Group-Lasso on splines 
for spectrum cartography," IEEE Trans, on Signal Proc, vol. 59, no. 10, 
pp. 4648-4663, Oct. 2011. 



